Words and Word Usage: Newspaper Text versus the Web

نویسندگان

  • Vinci Liu
  • James R. Curran
چکیده

This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is applied to the task of automatic thesaurus extraction, obtaining similar overall results to using the Gigaword. The quality of synonyms extracted for each target word is dependent on the word’s usage in the corpus. With many more words available on the web, a much larger Web Corpus can be created to obtain better results in different nlp tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Usage : Newspaper Text versus the Web

This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is ...

متن کامل

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...

متن کامل

Language Models

A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called N -GRAM MODELS are used) occur more frequently in the data ...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

Construction and Analysis of a Large Vietnamese Text Corpus

This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005